Statistical Machine Translation for Twitter

نویسندگان

Laura Jehl

Miles Osborne

چکیده

We consider the problem of translating short messages (Tweets) using Europarl as a starting-point. After highlighting some of the domain differences between Europarl and Twitter, we show that for German-English translation, we can improve performance from a baseline BLEU score of 25.58 to 53.45. By far and away the single most important improvement is passing-through unknown words (which are mainly URLs). Enforcing the length constraint upon translated output turnsout to be relatively simple. Since our Twitter translation involves little reordering, we conclude that the biggest challenge is lexical: dealing with unknown words, spelling mistakes, creative orthography and Twitter-idioms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

Linguistic steganography on Twitter: hierarchical language modeling with manual interaction

This work proposes a natural language stegosystem for Twitter, modifying tweets as they are written to hide 4 bits of payload per tweet, which is a greater payload than previous systems have achieved. The system, CoverTweet, includes novel components, as well as some already developed in the literature. We believe that the task of transforming covers during embedding is equivalent to unilingual...

متن کامل

Unsupervised cleansing of noisy text

In this paper we look at the problem of cleansing noisy text using a statistical machine translation model. Noisy text is produced in informal communications such as Short Message Service (SMS), Twitter and chat. A typical Statistical Machine Translation system is trained on parallel text comprising noisy and clean sentences. In this paper we propose an unsupervised method for the translation o...

متن کامل

The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language

Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...

متن کامل

Translating Government Agencies' Tweet Feeds: Specificities, Problems and (a few) Solutions

While the automatic translation of tweets has already been investigated in different scenarios, we are not aware of any attempt to translate tweets created by government agencies. In this study, we report the experimental results we obtained when translating 12 Twitter feeds published by agencies and organizations of the government of Canada, using a state-ofthe art Statistical Machine Translat...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Statistical Machine Translation for Twitter

نویسندگان

چکیده

منابع مشابه

A new model for persian multi-part words edition based on statistical machine translation

Linguistic steganography on Twitter: hierarchical language modeling with manual interaction

Unsupervised cleansing of noisy text

The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language

Translating Government Agencies' Tweet Feeds: Specificities, Problems and (a few) Solutions

عنوان ژورنال:

اشتراک گذاری